EN FR
EN FR


Section: New Results

Model selection in Regression and Classification

Participants : Gilles Celeux, Mohammed El Anbari, Clément Levrard, Erwan Le Pennec, Lucie Montuelle, Pascal Massart, Caroline Meynet, Jean-Michel Poggi, Adrien Saumard.

Erwan Le Pennec is still working with Serge Cohen (IPANEMA Soleil) on hyperspectral image segmentation based on a spatialized Gaussian Mixture Model. Their scheme is supported by some theoretical investigation [6] and have been applied in pratice with an efficient minimization algorithm combining EM algorithm, dynamic programming and model selection implemented with MIXMOD. Lucie Montuelle is studying extensions of this model that comprise parametric logistic weights and regression mixtures.

In collaboration with Marie-Laure Martin-Magniette (URGV et UMR AgroParisTech/INRA MIA 518) and Cathy Maugis (INSA Toulouse) Gilles Celeux has extended their variable selection procedure for model-based clustering and supervised classification to deal with high dimensional data sets with a backward selection procedure which is more efficient that the previous forward selection procedure in this context. Moreover they have analysed the differences between the model-based approach and geometrical approach to select variable for clustering. Through numerical experiments, they showed the advantage of the model-based appraoch when many variables are highly correlated. These variable selection procedures are in particular used for genomics applications which is the result of a collaboration with researchers of of URGV (Evry Genopole).

Caroline Meynet provided an 1 -oracle inequality satisfied by the Lasso estimator with the Kullback-Leibler loss in the framework of a finite mixture of Gaussian regressions model for high-dimensional heterogeneous data where the number of covariates may be much larger than the sample size. In particular, she has given a condition on the regularization parameter of the Lasso to obtain such an oracle inequality. This oracle inequality extends the 1 -oracle inequality established by Massart and Meynet in the homogeneous Gaussian linear regression case. It is deduced from a finite mixture Gaussian regression model selection theorem for 1 -penalized maximum likelihood conditional density estimation, which is inspired from Vapnik's method of structural risk minimization and from the theory on model selection for maximum likelihood estimators developed by Massart.

From an practical point of view, Caroline Meynet has introduced a procedure to select variables in model-based clustering in a high-dimensional context. In order to tackle with the problem of high-dimension, she has proposed to first use the Lasso in order to select different sets of variables and then estimate the density by a standard EM algorithm by reducing the inference to the linear space of the selected variables by the Lasso. Numerical experiments show that this method can outperform direct estimation by the Lasso.

In collaboration with Jean-Patrick Baudry (Paris 6) and Margarida Cardoso, Ana Ferreira and Maria-José Amorim (Lisbon University], Gilles Celeux has proposed an approach to select, in the model-based clustering context, a model and a number of clusters in order to get a partition which both provides a good fit with the data and is related to the external categorical variables. This approach makes use of the integrated joint likelihood of the data, the partition derived from the mixture model and the known partitions. It is worth noticing that the external categorical variables are only used to select a relevant mixture model. Each mixture model is fitted by the maximum likelihood methodology from the observed data. Numerical experiments illustrate the promising behaviour of the derived criterion [29] .

Since September 2008, Pascal Massart is the cosupervisor with Frédéric Chazal (GEOMETRICA) of the thesis of Claire Caillerie (GEOMETRICA). The project intends to explore and to develop new researches at the crossing of information geometry, computational geometry and statistics.

Tim van Erven is studying Model Selection for the Long Term. When a model selection procedure forms an integrated part of a company's day-to-day activities, its performance should be measured not on a single day, but on average over a longer period, like for example a year. Taking this long-term perspective, it is possible to aggregate model predictions optimally even when the data probability distribution is so irregular that no statistical guarantees can be given for any individual day seperately. He studies the relation between model selection for individual days and for the long term, and how the geometry of the models affects both. This work has potential applications in model aggregation for the forecasting of electrical load consumption at EDF.

Adrien saumard has worked on the theoretical validation of the slope heuristics, a practical method of penalties calibration derived in a Gaussian setting by Birgé and Massart in 2006 and extended to bounded M-estimation by Arlot and Massart in 2010. He was able to prove the validity of this heuristics in bounded heteroscedastic regression with random design when the considered models where linear spans made of piecewise polynomials. A preliminary work on a fixed model was necessary and published in [9] , while the validation of the slope heuristics itself - as well as the validation of a cross-validation approach - can be found in a preprint.